Develop a storyline that captures attention and maintains interest.
Your audience is your peers
Clearly state the problem or question you’re addressing.
Introduce why it is relevant needs.
Provide an overview of your approach.
Example of writing including citing references:
This is an introduction to ….. regression, which is a non-parametric estimator that estimates the conditional expectation of two variables which is random. The goal of a kernel regression is to discover the non-linear relationship between two random variables. To discover the non-linear relationship, kernel estimator or kernel smoothing is the main method to estimate the curve for non-parametric statistics. In kernel estimator, weight function is known as kernel function (efr2008?). Cite this paper (bro2014principal?). The GEE (wang2014?). The PCA (daffertshofer2004pca?)
This is my work and I want to add more work…
Methods
The Frequentist Framework
Linear regression can be achieved using a variety of methods, two of interest are frequentist and Bayesian. The frequentist approach to linear regression is the more familiar approach. It estimates the effects of independent variables(predictors) on dependent variables(the outcome). The regression coefficient is a point estimate, assumed to be a fixed value. Following is the frequentist linear model
\[
Y = \beta_0 + \beta_1X + \varepsilon \tag{1}
\]
The Bayesian approach estimates the relationship between predictors and an outcome in a similar way, however it’s regression coefficient is not a point estimate, but a distribution. That is, the regression coefficient is not assumed to be a fixed value. The Bayesian approach also goes a step further then frequentist regression in it’s inclusion of prior data. The Bayesian approach is so named because it is based on Bayes’ rule which is written as follows:
The normalization constant (\(p(y)\) above) ensures the posterior distribution is a valid distribution, but the posterior density function can be written without this constant. The resulting prediction is not a point estimate, but a distribution (Bayes 1763). The Bayesian approach is derived with Bayes’ theorem wherein the posterior distribution, the updated belief about the parameter given the data \(p(\theta|y)\), is proportional to the likelihood of \(\theta\) given \(y\), \(L(\theta|y)\), and the prior density of \(\theta\), \(p(\theta)\). The former is known as the likelihood function and would comprise the new data for analysis while the latter allows for the incorporation of prior knowledge regarding \(\theta\)(Yan and Su 2009).
To generate a model for our analysis, we start with the normal data model \(Y_i|\beta_0, \beta_1, \sigma \sim N(\mu, \sigma^2)\) and include a the mean specific to our predictor, departure time, \(\mu_i\). The model is:
\[
\begin{align*}
Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i
\end{align*}
\] Where: - \(Y_i\) is the arrival delay for the i-th flight - \(X_i\) is the departure delay for the i-th flight - \(\mu_i = \beta_0 + \beta_1X_i\) is the local mean arrival delay, , specific to the departure time - \(\sigma^2\) is the variance of the errors - \(\overset{\text{ind}}{\sim}\) indicates conditional independence of each arrival delay with the given parameters
Prior Selection
This analysis will explore the differences in Bayesian linear regression using flat priors and tuned priors.
Since we are only using two data variables, arrival delay and departure time, the regression parameters will be \(\beta_0\), \(\beta_1\), and \(\sigma\) for intercept, slope, and error As intercept and slope regression parameters can take any real value, we will use normal prior models (Johnson, Ott, and Dogucu 2022). \[
\begin{align*}
\beta_0 &\sim N(m_0, s^2_0)\\
\beta_1 &\sim N(m_1, s^2_1)
\end{align*}
\]
where \(m_0, s_0, m_1, \text{and } s_1\) are hyperparameters.
The standard deviation parameter must be positive, so we will use an exponential model (Johnson, Ott, and Dogucu 2022).
\[
\sigma \sim \text{Exp}(l)
\]
Due to the fact that the exponential model is a special case of the Gamma model, with \(s = 1\), we can use the definitions of the mean and variance of the gamma model to to find that of the exponential model (Johnson, Ott, and Dogucu 2022).
summary(Delays_sample$DEP_TIME_MINS) #mean departure time is 809.3 minutes (~ 1:30pm)Delays_sample_filtered_B0 <-subset(Delays_sample, DEP_TIME_MINS >=800& DEP_TIME_MINS <=820)mean(Delays_sample_filtered_B0$ARR_DELAY) #m_0c = 2sd(Delays_sample_filtered_B0$ARR_DELAY) #s_0c = 36
\(\beta_{0c}\) reflects the typical arrival delay at a typical departure time. With a mean departure time at \(\sim\) 1:30pm, the average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 36 minutes.
The slope of the lineal model indicates a 0.019 minute increase in arrival delay per minute increase in departure time, so we set \(m_1 = 0.02\). The standard error reflects high confidence at 0.0005, but as to not limit the model we will set it lower at \(s_1 = 0.01\).
\[
\beta_{1} \sim N(0.02, 0.01^2)
\]
\(\sigma\) informs the model standard deviation
Code
summary(lm_model)$sigma
To tune the exponential model, we set the expected value of the standard deviation, $ E() $, equal to the residual standard error, \(\sim 50\). With this, we can find the rate parameter, \(l\).
Data was collected by the Bureau of Transportation Statistics (BTS) and accessed through a dataset compiled by Patrick Zelazko (Zelazko 2023). The data was imported into R (R Core Team 2023) via CSV. This is a large time-series dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minutes and attributed reasons for delay. The function stan_glm() was used for simulation of the Normal Bayesian linear regression model from the “rstanarm” library(Brilleman et al. 2018). This function runs the Markov Chain Monte Carlo simulation as well with specified chains, iterations, and the ability to set a seed. These were set to 4 chains, 2000 iterations, and the seed was set to 123. Simulation of the posterior was done with the posterior_predict() function, also from the “rstanarm” library(Brilleman et al. 2018). Evaluation of the model was done by considering the data and it’s source, the assumptions of the model, and the accuracy of the prediction. The posterior predictions were evaluated with the prediction_summary() function from the “bayesrules”library (Dogucu, Johnson, and Ott 2021). This provided median absolute error (MAE) scaled MAE, and the proportion of values that fall within 50% and 95% confidence intervals.
Possibly k-fold cross validation, model averaging
Analysis and Results
Exploratory Data Analysis
Following are the definitions of the given variables in this dataset.
Header
Description
Fl Date
Flight Date (yyyy-mm-dd)
Airline
Airline Name
Airline DOT
Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
Airline Code
Unique Carrier Code
DOT Code
An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
Fl Number
Flight Number
Origin
Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Origin City
Origin City Name, State Code
Dest
Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Dest City
Destination City Name, State Code
CRS Dep Time
CRS Departure Time (local time: hhmm)
Dep Time
Actual Departure Time (local time: hhmm)
Dep Delay
Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
Taxi Out
Taxi Out Time, in Minutes
Wheels Off
Wheels Off Time (local time: hhmm)
Wheels On
Wheels On Time (local time: hhmm)
Taxi In
Taxi In Time, in Minutes
CRS Arr Time
CRS Arrival Time (local time: hhmm)
Arr Time
Actual Arrival Time (local time: hhmm)
Arr Delay
Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
Table 1: Summary includes morning, afternoon, and evening flight periods.
The three flight periods are each comprised of 8-hour segments (i.e. Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.
Some Visualizations of the Dataset
These histograms illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.
This visualization shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.
This heat map shows the average arrival delay for flights at their origin airport. This comes from the idea that if a flight is delayed at departure, then it may also be delayed on arrival at it’s destination.
Testing between AIRLINE_CODE and DEP_DELAY :
Testing between AIRLINE_CODE and TAXI_OUT :
Testing between AIRLINE_CODE and TAXI_IN :
Testing between AIRLINE_CODE and ARR_DELAY :
Testing between AIRLINE_CODE and CRS_ELAPSED_TIME :
Testing between AIRLINE_CODE and ELAPSED_TIME :
Testing between AIRLINE_CODE and AIR_TIME :
Testing between AIRLINE_CODE and DISTANCE :
Testing between ORIGIN and DEP_DELAY :
Testing between ORIGIN and TAXI_OUT :
Testing between ORIGIN and TAXI_IN :
Testing between ORIGIN and ARR_DELAY :
Testing between ORIGIN and CRS_ELAPSED_TIME :
Testing between ORIGIN and ELAPSED_TIME :
Testing between ORIGIN and AIR_TIME :
Testing between ORIGIN and DISTANCE :
Testing between DEST and DEP_DELAY :
Testing between DEST and TAXI_OUT :
Testing between DEST and TAXI_IN :
Testing between DEST and ARR_DELAY :
Testing between DEST and CRS_ELAPSED_TIME :
Testing between DEST and ELAPSED_TIME :
Testing between DEST and AIR_TIME :
Testing between DEST and DISTANCE :
Testing between DELAY_DUE_CARRIER and DEP_DELAY :
Testing between DELAY_DUE_CARRIER and TAXI_OUT :
Testing between DELAY_DUE_CARRIER and TAXI_IN :
Testing between DELAY_DUE_CARRIER and ARR_DELAY :
Testing between DELAY_DUE_CARRIER and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_CARRIER and ELAPSED_TIME :
Testing between DELAY_DUE_CARRIER and AIR_TIME :
Testing between DELAY_DUE_CARRIER and DISTANCE :
Testing between DELAY_DUE_WEATHER and DEP_DELAY :
Testing between DELAY_DUE_WEATHER and TAXI_OUT :
Testing between DELAY_DUE_WEATHER and TAXI_IN :
Testing between DELAY_DUE_WEATHER and ARR_DELAY :
Testing between DELAY_DUE_WEATHER and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_WEATHER and ELAPSED_TIME :
Testing between DELAY_DUE_WEATHER and AIR_TIME :
Testing between DELAY_DUE_WEATHER and DISTANCE :
Testing between DELAY_DUE_NAS and DEP_DELAY :
Testing between DELAY_DUE_NAS and TAXI_OUT :
Testing between DELAY_DUE_NAS and TAXI_IN :
Testing between DELAY_DUE_NAS and ARR_DELAY :
Testing between DELAY_DUE_NAS and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_NAS and ELAPSED_TIME :
Testing between DELAY_DUE_NAS and AIR_TIME :
Testing between DELAY_DUE_NAS and DISTANCE :
Testing between DELAY_DUE_SECURITY and DEP_DELAY :
Testing between DELAY_DUE_SECURITY and TAXI_OUT :
Testing between DELAY_DUE_SECURITY and TAXI_IN :
Testing between DELAY_DUE_SECURITY and ARR_DELAY :
Testing between DELAY_DUE_SECURITY and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_SECURITY and ELAPSED_TIME :
Testing between DELAY_DUE_SECURITY and AIR_TIME :
Testing between DELAY_DUE_SECURITY and DISTANCE :
Testing between DELAY_DUE_LATE_AIRCRAFT and DEP_DELAY :
Testing between DELAY_DUE_LATE_AIRCRAFT and TAXI_OUT :
Testing between DELAY_DUE_LATE_AIRCRAFT and TAXI_IN :
Testing between DELAY_DUE_LATE_AIRCRAFT and ARR_DELAY :
Testing between DELAY_DUE_LATE_AIRCRAFT and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_LATE_AIRCRAFT and ELAPSED_TIME :
Testing between DELAY_DUE_LATE_AIRCRAFT and AIR_TIME :
Testing between DELAY_DUE_LATE_AIRCRAFT and DISTANCE :
Testing between CRS_DEP_HOUR and DEP_DELAY :
Testing between CRS_DEP_HOUR and TAXI_OUT :
Testing between CRS_DEP_HOUR and TAXI_IN :
Testing between CRS_DEP_HOUR and ARR_DELAY :
Testing between CRS_DEP_HOUR and CRS_ELAPSED_TIME :
Testing between CRS_DEP_HOUR and ELAPSED_TIME :
Testing between CRS_DEP_HOUR and AIR_TIME :
Testing between CRS_DEP_HOUR and DISTANCE :
Testing between DEP_HOUR and DEP_DELAY :
Testing between DEP_HOUR and TAXI_OUT :
Testing between DEP_HOUR and TAXI_IN :
Testing between DEP_HOUR and ARR_DELAY :
Testing between DEP_HOUR and CRS_ELAPSED_TIME :
Testing between DEP_HOUR and ELAPSED_TIME :
Testing between DEP_HOUR and AIR_TIME :
Testing between DEP_HOUR and DISTANCE :
Testing between WHEELS_OFF_HOUR and DEP_DELAY :
Testing between WHEELS_OFF_HOUR and TAXI_OUT :
Testing between WHEELS_OFF_HOUR and TAXI_IN :
Testing between WHEELS_OFF_HOUR and ARR_DELAY :
Testing between WHEELS_OFF_HOUR and CRS_ELAPSED_TIME :
Testing between WHEELS_OFF_HOUR and ELAPSED_TIME :
Testing between WHEELS_OFF_HOUR and AIR_TIME :
Testing between WHEELS_OFF_HOUR and DISTANCE :
Testing between WHEELS_ON_HOUR and DEP_DELAY :
Testing between WHEELS_ON_HOUR and TAXI_OUT :
Testing between WHEELS_ON_HOUR and TAXI_IN :
Testing between WHEELS_ON_HOUR and ARR_DELAY :
Testing between WHEELS_ON_HOUR and CRS_ELAPSED_TIME :
Testing between WHEELS_ON_HOUR and ELAPSED_TIME :
Testing between WHEELS_ON_HOUR and AIR_TIME :
Testing between WHEELS_ON_HOUR and DISTANCE :
Testing between CRS_ARR_HOUR and DEP_DELAY :
Testing between CRS_ARR_HOUR and TAXI_OUT :
Testing between CRS_ARR_HOUR and TAXI_IN :
Testing between CRS_ARR_HOUR and ARR_DELAY :
Testing between CRS_ARR_HOUR and CRS_ELAPSED_TIME :
Testing between CRS_ARR_HOUR and ELAPSED_TIME :
Testing between CRS_ARR_HOUR and AIR_TIME :
Testing between CRS_ARR_HOUR and DISTANCE :
Testing between ARR_HOUR and DEP_DELAY :
Testing between ARR_HOUR and TAXI_OUT :
Testing between ARR_HOUR and TAXI_IN :
Testing between ARR_HOUR and ARR_DELAY :
Testing between ARR_HOUR and CRS_ELAPSED_TIME :
Testing between ARR_HOUR and ELAPSED_TIME :
Testing between ARR_HOUR and AIR_TIME :
Testing between ARR_HOUR and DISTANCE :
Testing between FLIGHT_PERIOD and DEP_DELAY :
Testing between FLIGHT_PERIOD and TAXI_OUT :
Testing between FLIGHT_PERIOD and TAXI_IN :
Testing between FLIGHT_PERIOD and ARR_DELAY :
Testing between FLIGHT_PERIOD and CRS_ELAPSED_TIME :
Testing between FLIGHT_PERIOD and ELAPSED_TIME :
Testing between FLIGHT_PERIOD and AIR_TIME :
Testing between FLIGHT_PERIOD and DISTANCE :
Modeling and Results
Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Conclusion
Summarize your key findings.
Discuss the implications of your results.
References
Bayes, T. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances. 1763.”
Brilleman, SL, MJ Crowther, M Moreno-Betancur, J Buros Novik, and R Wolfe. 2018. “Joint Longitudinal and Time-to-Event Models via Stan.”https://github.com/stan-dev/stancon_talks/.
Dogucu, Mine, Alicia Johnson, and Miles Ott. 2021. Bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book. https://github.com/bayes-rules/bayesrules.
Johnson, Alicia A, Miles Q Ott, and Mine Dogucu. 2022. Bayes Rules!: An Introduction to Bayesian Modeling with R. Chapman & Hall.
Lesaffre, Emmanuel, and Andrew B Lawson. 2012. Bayesian Biostatistics. 1st ed. Somerset: John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781119942412.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.